Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite: Pythonic, ocrd v3, utilise page-level annotation #28

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Feb 17, 2025

First attempt at a full OCR-D processor for this. Builds on core 3.0 – which brings error handling and page parallelism. (For that we require Python instead of bashlib. But Python is already much faster sequentially.)

  • To deal with the OCR-D annotation – with cropping and deskewing, possibly also binarization and denoising, or even text-image separation (clipped) – it does not suffice to pass the original image and PAGE file name to the converter; instead, one needs to extract/generate the derived image for the page, and then transform all coordinates of the PAGE accordingly. This is borrowed from ocrd-segment-replace-original.
  • Ships PRImA PDF converter as package data
  • Can cope with METS Server – thus, get_metadata (with its XML tree operations) needs to read from a filesystem copy of the METS instead of the ClientSideOcrdMets
  • I kept the multipagepdf code, but separated the functions
  • negative2zero was too simplistic. In OCR-D, we have the PageValidator against all kinds of coordinate invalidities and inconsistencies. I borrowed from ocrd-segment-repair for actual repairs (although this is debatable, we should keep this as a separate step; also, I had to copy and paste a lot of polygon handling code).

However, there still seems to be a problem with the coordinates of the outlines...

Next I'll add further improvements:

  • filtering / selection of image features (e.g. binarized or not)
  • font param as installable resmgr resources
  • further metadata from MODS
  • logical structMap to bookmarks (outline labels)
  • option to write page-wise PDFs into temporary storage only
  • basic tests and CI

This depends on OCR-D/core#1305.

@JKamlah
Copy link
Member

JKamlah commented Feb 18, 2025

Thank you very much @bertsky ,
for all your efforts to keep the module up-to-date. If you are satisfied with your updates and the module is compatible with the latest OCR-D standards, I will gladly accept the PR.

…leGrp for images, no parsing/validation/repair)
…add pytest option --workspace for subsets, determine input fileGrp automatically, download and process up to 4 random pages only, test PAGE2PDF and ALTO2PDF, depending on whether PAGE or ALTO is in the workspace
@bertsky
Copy link
Contributor Author

bertsky commented Feb 22, 2025

next up:

  • update Readme
  • add CI (basically make test PYTEST_ARGS="--workspace all -vv")

Notice that this also supports things like ocrd-altotopdf -I FULLTEXT,ORIGINAL -O DOWNLOAD -P multipage FULLDOWNLOAD now. And both processors add a table of contents now.

I wonder if the multipage file ID should really be specified manually. It may be difficult to come up with a non-conflicting name in a scripted setting. In contrast, the tool itself could try with mets.unique_identifier or identifiers from MODS, and convert these to a safe XML ID. (So the multipage parameter would just become a boolean.) What do you think?

@bertsky bertsky marked this pull request as ready for review February 22, 2025 03:08
@JKamlah
Copy link
Member

JKamlah commented Feb 25, 2025

Thanks @bertsky for all your work. We still have to decide how we want to proceed with the repository in general. I hope that we will have made a decision by the end of the week.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 4, 2025

next up:

done. I have also added continuous deployment. You would need to add the following in the repo settings to make everything work:

  • add an environment secret DOCKERHUB_USERNAME
  • add an environment secret DOCKERHUB_PASSWORD
  • log in to PyPI.org and create a security token, copy...
  • ...and paste as new environment secret PYPI_TOKEN

We still have to decide how we want to proceed with the repository in general.

What do you mean?

@stweil
Copy link
Member

stweil commented Mar 4, 2025

We still have to decide how we want to proceed with the repository in general.

What do you mean?

We are talking with the OCR-D coordination team about moving this repository to https://github.com/OCR-D/.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 4, 2025

We are talking with the OCR-D coordination team about moving this repository to https://github.com/OCR-D/.

I see. If and when that's certain, please let me know so I can adapt the upstream URLs (packaging, CI+CD) before this is merged.

@JKamlah
Copy link
Member

JKamlah commented Mar 5, 2025

Edited: We have decided to move the repository to https://github.com/OCR-D/. @kba should now be able to carry out the transfer at the right time.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 5, 2025

We have now decided to move the repository to https://github.com/OCR-D/ in the next few days. Do you need more time to adapt the code?

done: 7021614

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants